This is problem set #2, in which we hope you will practice the visualization package ggplot2, as well as hone your knowledge of the packages tidyr and dplyr.
Note, that this example is from the_grammar.R on http://had.co.nz/ggplot2 I’ve adapted this for psych 254 purposes
First install and load the package.
#install.packages("ggplot2")
library(ggplot2)
Now we’re going to use qplot. qplot is the easy interface, meant to replace plot. You can give it simple qplot(x,y) examples, or slightly more complex examples like qplot(x, y, col=grp, data=d).
We’re going to be using the diamonds dataset. This is a set of measurements of diamonds, along with their price etc.
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
qplot(diamonds$carat, diamonds$price)
Scatter plots are trivial, and easy to add features to. Modify this plot so that it uses the dataframe rather than working from variables in the general namespace (good to get away from retyping diamonds$ every time you reference a variable).
qplot(carat, price, data=diamonds)
Try adding clarity and cut, using shape and color as your visual variables.
qplot(x=carat, y=price, shape=cut, color=clarity, data=diamonds)
One of the primary benefits of ggplot2 is the use of facets - also known as small multiples in the Tufte vocabulary. That last plot was probably hard to read. Facets could make it better. Try adding a facets = x ~ y argument. x ~ y means row facets are by x, column facets by y.
qplot(x=carat, y=price, facets = clarity ~ cut, data=diamonds)
But facets can also get overwhelming. Try to strike a good balance between color, shape, and faceting.
HINT: facets = . ~ x puts x on the columns, but facets = ~ x (no dot) wraps the facets. These are underlying calls to different functions, facet_wrap (no dot) and facet_grid (two arguments).
qplot(x=carat, y=price, facets = clarity ~ cut, data=diamonds)
The basic unit of a ggplot plot is a “geom” - a mapping between data (via an “aesthetic”) and a particular geometric configuration on coordinate axes.
Let’s try some other geoms and manipulate their parameters. First, try a histogram (geom="hist").
qplot(x=carat, geom="histogram", data=diamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now facet your histogram by clarity and cut.
qplot(x=carat, geom="histogram", facets = clarity ~ cut, data=diamonds)
I like a slightly cleaner look to my plots. Luckily, ggplot allows you to add “themes” to your plots. Try doing the same plot but adding + theme_bw() or + theme_classic(). Different themes work better for different applications, in my experience.
qplot(x=carat, geom="histogram", facets = clarity ~ cut, data=diamonds) +
theme_bw()
ggplot is just a way of building qplot calls up more systematically. It’s sometimes easier to use and sometimes a bit more complicated. What I want to show off here is the functionality of being able to build up complex plots with multiple elements. You can actually do this using qplot pretty easily, but there are a few things that are hard to do.
ggplot is the basic call, where you specify A) a dataframe and B) an aesthetic mapping from variables in the plot space to variables in the dataset.
d <- ggplot(diamonds, aes(x=carat, y=price)) # first you set the aesthetic and dataset
d + geom_point() # then you add geoms
d + geom_point(aes(colour = carat)) # and you can keep doing this to add layers to the plot
Try writing this as a single set of additions (e.g. one line of R code, though you can put in linebreaks). This is the most common workflow for me.
ggplot(diamonds, aes(x=carat, y=price, color=carat)) +
geom_point()
You can also set the aesthetic separately for each geom, and make some great plots this way. Though this can get complicated. Try using ggplot to build a histogram of prices.
ggplot(diamonds, aes(x=price)) +
geom_histogram() +
facet_grid(clarity ~ cut) +
theme_bw()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Sklar et al. (2012) claims evidence for unconscious arithmetic processing. We’re going to do a reanalysis of their Experiment 6, which is the primary piece of evidence for that claim. The data are generously contributed by Asael Sklar.
First let’s set up a few preliminaries.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stats)
theme_set(theme_bw())
sem <- function(x) {sd(x) / sqrt(length(x))}
ci95 <- function(x) {sem(x) * 1.96}
First read in two data files and subject info. A and B refer to different trial order counterbalances.
subinfo <- read.csv("../data/sklar_expt6_subinfo_corrected.csv")
d.a <- read.csv("../data/sklar_expt6a_corrected.csv")
d.b <- read.csv("../data/sklar_expt6b_corrected.csv")
Gather these datasets into long form and get rid of the Xs in the headers.
d.a_tidy <- d.a %>%
gather(subid, rt, starts_with("X")) %>%
mutate(subid = as.integer(gsub("X", "", subid)))
d.b_tidy <- d.b %>%
gather(subid, rt, starts_with("X")) %>%
mutate(subid = as.integer(gsub("X", "", subid)))
Bind these together. Check out bind_rows.
d.a.b <- bind_rows(d.a_tidy, d.b_tidy)
Merge these with subject info. You will need to look into merge and its relatives, left_join and right_join. Call this dataframe d, by convention.
d <- left_join(d.a.b, subinfo, by="subid")
Clean up the factor structure.
d$presentation.time <- factor(d$presentation.time)
levels(d$operand) <- c("addition","subtraction")
Examine the basic properties of the dataset. First, take a histogram.
qplot(rt, data=d)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Challenge question: what is the sample rate of the input device they are using to gather RTs?
rt_count <- d %>%
group_by(rt) %>%
summarise(count = n())
Sklar et al. did two manipulation checks. Subjective - asking participants whether they saw the primes - and objective - asking them to report the parity of the primes (even or odd) to find out if they could actually read the primes when they tried. Examine both the unconscious and conscious manipulation checks (this information is stored in subinfo). What do you see? Are they related to one another?
# look at distributions
qplot(objective.test, facets = . ~ subjective.test, data=d)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
# are the two manipulations checks related
qplot(x=objective.test, y=subjective.test, data=d) +
stat_smooth(method="glm", family="binomial")
# model
m1 <- glm(subjective.test ~ objective.test,
family="binomial",data=d)
summary(m1)
##
## Call:
## glm(formula = subjective.test ~ objective.test, family = "binomial",
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.37183 -0.82705 -0.07408 0.66345 1.96951
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.8805 0.1548 -37.98 <2e-16 ***
## objective.test 9.3629 0.2526 37.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8966.6 on 6467 degrees of freedom
## Residual deviance: 6402.9 on 6466 degrees of freedom
## AIC: 6406.9
##
## Number of Fisher Scoring iterations: 5
Based on the plot and model, it looks like performance on the two manipulation checks are related: when you score higher on the objective test, you are more likely to respond “yes” on the subjective manipulation check.
OK, let’s turn back to the measure and implement Sklar et al.’s exclusion criterion. You need to have said you couldn’t see (subjective test) and also be not significantly above chance on the objective test (< .6 correct). Call your new data frame ds.
ds <- d %>%
filter(subjective.test == 0,
objective.test < .6)
# get numbers of subjects filtered
n_filt <- d %>%
group_by(presentation.time) %>%
summarise(unfiltered_n = n_distinct(subid))
n_filt <- ds %>%
group_by(presentation.time) %>%
summarise(filtered_n = n_distinct(subid)) %>%
select(filtered_n) %>%
bind_cols(n_filt) %>%
mutate(n_subs_removed = unfiltered_n - filtered_n) %>%
select(presentation.time, unfiltered_n, n_subs_removed)
n_filt
## presentation.time unfiltered_n n_subs_removed
## 1 1700 20 12
## 2 2000 22 13
Sklar et al. show a plot of a “facilitation effect” - the time to respond to incongruent primes minus the time to respond to congruent primes. They then show plot this difference score for the subtraction condition and for the two presentation times they tested. Try to reproduce this analysis.
HINT: first take averages within subjects, then compute your error bars across participants, using the sem function (defined above).
ms <- ds %>%
group_by(subid, congruent, presentation.time, operand) %>%
summarise(mean_rt = mean(rt, na.rm=T)) %>%
spread(congruent, mean_rt) %>%
mutate(mean_diff = no - yes) %>%
select(-no, -yes) %>%
group_by(operand, presentation.time) %>%
summarise(mean = mean(mean_diff),
sem = sem(mean_diff))
Now plot this summary, giving more or less the bar plot that Sklar et al. gave (though I would keep operation as a variable here. Make sure you get some error bars on there (e.g. geom_errorbar or geom_linerange).
qplot(x=presentation.time, y=mean, geom="bar", stat="identity", width = 0.5, data=ms) +
geom_linerange(aes(ymax = mean + sem, ymin = mean - sem), width=0.2) +
facet_grid( ~ operand)
## Warning: Stacking not well defined when ymin != 0
What do you see here? How close is it to what Sklar et al. report? Do the error bars match? How do you interpret these data?
Challenge problem: verify Sklar et al.’s claim about the relationship between RT and the objective manipulation check.
ss <- ds %>%
group_by(subid, congruent, presentation.time, operand,
objective.test) %>%
summarise(mean_rt = mean(rt, na.rm=T)) %>%
spread(congruent, mean_rt) %>%
mutate(mean_diff = no - yes) %>%
select(-no, -yes)
qplot(x=objective.test, y=mean_diff,
facets = presentation.time ~ operand, data=ss) +
geom_smooth(method="lm", se=F)
Show us what you would do with these data, operating from first principles. What’s the fairest plot showing a test of Sklar et al.’s original hypothesis that people can do arithmetic “non-consciously”?
Challenge problem: Do you find any statistical support for Sklar et al.’s findings?